In this lab, we’ll use some of the functions and methods we’ve looked at previously to carry out some simple data analysis for a couple of datasets. You’ll need the following files for this lab (all should be available on the Google drive):
In the first example, we’ll explore variations in median income in Georgia at the county level. This example is modified from Alex Comber’s GEOG 3915 GeoComputation class. We’ll start with some simple exploration and then move on to running some simple statistical analysis. First load the libraries that we’ll need (these should all have been installed in previous labs):
library(tidyverse)
library(sf)
library(tmap)
Next, we’ll read in the data. All the information we need is held in
the the georgia shapefile, so load this with
st_read():
georgia <- st_read("./data/georgia/georgia.shp", quiet = TRUE)
This shapefile contains a number of variables for the counties including the percentage of the population in each County that:
PctRural)PctBach)PctEld)PctFB)PctPov)PctBlack)and the median income of the county (MedInc in
dollars)
Check to make sure that the coordinate reference system is set:
st_crs(georgia)
The variable we are interested in is the median income
(MedInc). Remember you can check the column names with
names() or the data structure with str().
Let’s map this out to see if there’s a pattern:
tm_shape(georgia) +
tm_fill("MedInc", palette = 'viridis') +
tm_layout(legend.outside = TRUE)
There’s a reasonably strong N-S gradient in income values, with higher values around Atlanta in the north, and Savannah on the southeastern coast. The values are in $/yr, we’ll rescale them here to make the values a little more manageable:
georgia <- georgia %>%
mutate(MedInc000 = MedInc / 1000)
Let’s get some summary statistics on this variable:
summary(georgia$MedInc000)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 23.46 29.77 34.31 37.15 41.20 81.60
And make a histogram showing the distribution:
ggplot(georgia, aes(x = MedInc000)) +
geom_histogram(col = 'lightgray', fill = 'darkorange') +
scale_x_continuous("Median Income $000s") +
theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The distribution of median income is right-skewed: most of the values fall between about $22K and $40K, but with a few higher values. Normally, we would log-transform these values before working with them further, as this reduces the skew. We’ll skip that here to make some if the results more interpretable, but we can see what difference this would make by changing the x-axis to a log scale:
ggplot(georgia, aes(x = MedInc000)) +
geom_histogram(col = 'lightgray', fill = 'darkorange') +
scale_x_log10("Median Income $000s") +
theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Very briefly, we’ll also demonstrate a simple form of spatial
exploration with hotspot analysis. This is based on the Getis-Ord \(G\) statistic, and indicates of there are
parts of a region that have much higher (hotspot), or much lower
(coldspot), values than might be expected. If you want to run this, you
will need to install the spdep library, and then carry out
the following steps:
poly2nb - this
identifies which spatial locations are considered to be neighbors of
each othernb2listw
- a numeric representation of the neighborhoodslocalGlibrary(spdep)
## Loading required package: spData
## To access larger datasets in this package, install the spDataLarge
## package with: `install.packages('spDataLarge',
## repos='https://nowosad.github.io/drat/', type='source')`
nb <- poly2nb(georgia)
lw <- nb2listw(nb)
georgia_localG <- localG(georgia$MedInc000,
listw = lw)
Once run, we can plot the map, which not too surprisingly shows hotspots around Atlanta and on the coast.
georgia$lG <- georgia_localG
tm_shape(georgia) +
tm_fill("lG", palette = "-RdBu") +
tm_borders('lightgray') +
tm_layout(main.title = "Getis Ord G Values")
## Variable(s) "lG" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.
R comes with a large number of statistical tests. We’ll just look
here at the simplest, the \(t\)-test.
The format of the output is broadly similar for most of these tests, so
the interpretation we make here should transfer to other tests. The
\(t\)-test is a test for a significant
difference of means between two groups. We’ll create two groups
using the PctBach variable (the percent of the population
in each county with higher education degree). We’ll split the counties
by whether they are above the median PctBach value for the
state:
median(georgia$PctBach)
## [1] 9.4
We’ll use a combination of mutate and the
ifelse function to create a new, binary variable:
georgia <- georgia %>%
mutate(higher_ed = ifelse(PctBach > 10, "High", "Low"))
And we can use boxplots to examine the difference in median income for these two groups:
ggplot(georgia, aes(x = higher_ed, y = MedInc000)) +
geom_boxplot() +
scale_y_continuous("Med Inc $000s") +
theme_bw()
We’ll now run a \(t\)-test. For
this, we need to define the variable containing the groups
(higher_ed), and the variable we want to run the test on
(MedInc000).
We define this using R’s formula syntax. This syntax is used across a
lot of test and modeling functions and is written as y ~ x,
where y is the dependent variable (the one we want
to test) and x is the independent (the one we want to use
to run the test). We’ll see this again in the next section:
t.test(MedInc000 ~ higher_ed, georgia)
##
## Welch Two Sample t-test
##
## data: MedInc000 by higher_ed
## t = 5.2655, df = 98.429, p-value = 8.219e-07
## alternative hypothesis: true difference in means between group High and group Low is not equal to 0
## 95 percent confidence interval:
## 5.467864 12.081624
## sample estimates:
## mean in group High mean in group Low
## 42.22430 33.44955
This test gives a lot of output, but the main ones are
t): this is used in running the
test (here a standardized difference)When \(t\) is high and the \(p\)-value is low (\(<0.05\)) as here, this indicates a significant difference between the two groups.
Next, we’ll build a simple regression model of the income values
using the original values of percent higher education in each county.
I’m sure you’re familiar with this, but as a quick recap, this models
the changes in a dependent variable as a function of an intercept (\(\beta_0\)) and a slope (\(\beta_1\)) times a covariate
(PctBach):
\[ y = \beta_0 + \beta_1 x + e \]
The lm() function is used in R to build simple linear
regression models. This again uses the formula syntax described above,
and we need to specify the R object that contains the variables for the
model (georgia). We’ll fit this now, and use the
summary() function to look at the results:
fit_lm1 <- lm(MedInc000 ~ PctBach, data = georgia)
summary(fit_lm1)
##
## Call:
## lm(formula = MedInc000 ~ PctBach, data = georgia)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.124 -6.025 -1.905 3.416 39.969
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.5913 1.5476 17.182 < 2e-16 ***
## PctBach 0.9643 0.1255 7.684 1.57e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.986 on 157 degrees of freedom
## Multiple R-squared: 0.2733, Adjusted R-squared: 0.2687
## F-statistic: 59.04 on 1 and 157 DF, p-value: 1.566e-12
There’s a lot of output here, but the most important parts are:
As a very quick interpretation, we have an intercept of 26.6 and a slope of 1. The slope close to 1 suggests that median income increases by approximate $1,000 dollars for every percent increase in higher education. The R2 is around 0.27, indicating that 27% of the variation in income is explained by our model.
We can visualize this model by using ggplot2’s
geom_smooth() function to add a regression line to a
scatter plots:
ggplot(georgia, aes(x = PctBach, y = MedInc000)) +
geom_point() +
geom_smooth(method = "lm") +
theme_bw()
## `geom_smooth()` using formula = 'y ~ x'
(This shows, again, the uneven distribution of values.)
You can add more covariates to the model by extending the right hand side of the formula. Here we’ll add the percent elderly and percent foreign born to explore their relationship with income
fit_lm2 <- lm(MedInc000 ~ PctBach + PctEld + PctFB, data = georgia)
summary(fit_lm2)
##
## Call:
## lm(formula = MedInc000 ~ PctBach + PctEld + PctFB, data = georgia)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.979 -4.789 -0.922 2.712 34.805
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 51.5661 3.5282 14.616 < 2e-16 ***
## PctBach 0.7980 0.1479 5.397 2.49e-07 ***
## PctEld -1.7890 0.2312 -7.738 1.21e-12 ***
## PctFB -1.9020 0.6914 -2.751 0.00665 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.672 on 155 degrees of freedom
## Multiple R-squared: 0.4771, Adjusted R-squared: 0.4669
## F-statistic: 47.13 on 3 and 155 DF, p-value: < 2.2e-16
The new covariates have negative coefficients, indicating the income declines as these increase in Georgia. The R\(^2\) has increased to approaximately 0.47.
Regression models (and most other models) are aspatial and are commonly referred to as global models, as the model coefficients are assumed to hold true everywhere. In reality this assumption of spatial invariance is violated in many instances when geographic phenomena are considered. For example, the association between PctBach and MedInc, might be different in different places. To assess this, we can use geographically weighted regression or GWR. This estimates a series of local models, one per spatial location. Each model is built on a subset of data: a set of locations in a window around the location of interest. As a results, the coefficients vary in space. The choice of window size is important, as it dictates the number of observations used in each local model, and so the quality of that model. While there has been several papers criticizing this approach as a complete modeling method, it is very useful for exploring potential variation in the relationship between dependent and independent variables
We now build the GWR model using the spgwr package. Building a GWR model requires two steps, the first to assess the best window size, and the second to build and diagnose the local models using this window. The window can be chosen as
Here, we will use the second method, by setting the parameter
adapt = TRUE in the GWR function. We first need to extract
polygon centroids for use in the distance calculations (these will be
used to select the locations within a window around a point of
interest).
library(spgwr)
georgia_crds = st_coordinates(st_centroid(georgia))
plot(st_geometry(georgia))
points(georgia_crds, pch = 16)
The gwr.sel function can be used to work out the optimum
window size. It does this by removing a subset of the locations and
testing how well the remaining locations can predict income for the
subset. This is iterated across a set of bandwidths until the optimum is
found.
gwr_bw = gwr.sel(MedInc000 ~ PctBach + PctEld + PctFB,
coords = georgia_crds, data = georgia, adapt = TRUE)
## Adaptive q: 0.381966 CV score: 8454.503
## Adaptive q: 0.618034 CV score: 9109.154
## Adaptive q: 0.236068 CV score: 7977.137
## Adaptive q: 0.145898 CV score: 7762.359
## Adaptive q: 0.09016994 CV score: 7714.123
## Adaptive q: 0.07639362 CV score: 7673.659
## Adaptive q: 0.04721385 CV score: 7913.753
## Adaptive q: 0.06524794 CV score: 7616.704
## Adaptive q: 0.05835953 CV score: 7715.613
## Adaptive q: 0.06950521 CV score: 7596.978
## Adaptive q: 0.06901448 CV score: 7594.047
## Adaptive q: 0.0681993 CV score: 7599.133
## Adaptive q: 0.06894048 CV score: 7594.512
## Adaptive q: 0.06912234 CV score: 7593.367
## Adaptive q: 0.06926858 CV score: 7594.05
## Adaptive q: 0.06916303 CV score: 7593.11
## Adaptive q: 0.06920372 CV score: 7593.25
## Adaptive q: 0.06916303 CV score: 7593.11
gwr_bw
## [1] 0.06916303
As this is an adaptive bandwidth, the optimum value (0.069) indicates the proportion of counties that are included in each local model (about 7%). We can now use this to estimate the full set of local models (one per county):
fit_gwr = gwr(MedInc000 ~ PctBach + PctEld + PctFB,
coords = georgia_crds, data = georgia, adapt = gwr_bw)
fit_gwr
## Call:
## gwr(formula = MedInc000 ~ PctBach + PctEld + PctFB, data = georgia,
## coords = georgia_crds, adapt = gwr_bw)
## Kernel function: gwr.Gauss
## Adaptive quantile: 0.06916303 (about 10 of 159 data points)
## Summary of GWR coefficient estimates at data points:
## Min. 1st Qu. Median 3rd Qu. Max. Global
## X.Intercept. 38.933286 51.703499 56.530763 63.546895 78.252407 51.566
## PctBach -0.095338 0.358524 0.554166 0.884791 1.508898 0.798
## PctEld -3.265517 -2.397512 -2.050739 -1.707752 -1.117719 -1.789
## PctFB -6.921809 -3.515077 -1.845325 -0.649226 1.331465 -1.902
The output of the model now contains a range of coefficient estimates
for each variable showing how much these vary across Georgia. We can
also visualize this, by extracting the values for each county. These are
held in a object called SDF in the output of the
gwr() function:
fit_gwr$SDF
## class : SpatialPointsDataFrame
## features : 159
## extent : 949905.4, 1392468, -674547.7, -217623.8 (xmin, xmax, ymin, ymax)
## crs : NA
## variables : 8
## names : sum.w, (Intercept), PctBach, PctEld, PctFB, gwr.e, pred, localR2
## min values : 15.0109625667414, 38.9332861313754, -0.0953381517537397, -3.26551722658842, -6.92180900589457, -16.9322238406029, 20.2667549521815, 0.553666131713563
## max values : 29.0794134535373, 78.2524067679104, 1.50889803811775, -1.11771940077526, 1.33146450846935, 27.466910234057, 64.731234060467, 0.765569568212537
We can attach these back to the georgia sf object and
make some maps:
georgia$b_PctBach = fit_gwr$SDF$PctBach
georgia$b_PctEld = fit_gwr$SDF$PctEld
georgia$b_PctFB = fit_gwr$SDF$PctFB
georgia$localR2 = fit_gwr$SDF$localR2
The last line extracts the local R2 (the R2 for each model). Now let’s map some of these values:
tm_shape(georgia) +
tm_fill( "localR2", palette = "viridis", style = "kmeans") +
tm_layout(legend.position = c("right","top"), frame = F)
tm_shape(georgia) +
tm_fill( "b_PctEld", palette = "viridis", style = "kmeans") +
tm_layout(legend.position = c("right","top"), frame = F)
This suggests that relationship between percent elderly and income is much stronger around the metropoloitan regions of the state
tm_shape(georgia) +
tm_fill(c("b_PctFB", "b_PctBach"),midpoint = 0, style = "kmeans") +
tm_style("col_blind")+
tm_layout(legend.position = c("right","top"), frame = F)
Machine learning is a set of algorithms and methods that can identify patterns in data and predict from these patterns. This has a number of uses with spatial data, including predicting land cover or species occurrences, or with remote sensing data for segmentation and classification. Here, we’ll use a set of landslide occurrences from southern Ecuador to make a map of landslide risk. We’ll use a machine learning approach to examine what covariates are linked with landslides or absences of landslides, and then use a raster of these values to produce a full risk map. This example is modified from Robin Lovelace’s Geocomputation in R book.
You’ll need the following files to run this lab, so make sure you have them downloaded from the Google drive.
You’ll also need to install a package to help with machine learning
(mlr3). This is a meta-package, and is used to
help streamline the machine learning process by providing a standard
interface to a large number of algorithms, as well as functions to test
models. This comes as a set of packages, so the easiest is to install
the whole set with (install.packages("mlr3verse")). We’ll
also be using the pROC package, so install that as
well. Once installed, load the libraries we’ll need:
library(terra)
library(dplyr)
library(pROC)
library(mlr3verse)
Next, we’ll load the landslide occurrences:
lsl <- read.csv("./data/lsl.csv")
head(lsl)
## x y lslpts slope cplan cprof elev
## 1 713887.7 9558537 FALSE 33.75185 0.023180449 0.003193061 2422.810
## 2 712787.7 9558917 FALSE 39.40821 -0.038638908 -0.017187813 2051.771
## 3 713407.7 9560307 FALSE 37.45409 -0.013329108 0.009671087 1957.832
## 4 714887.7 9560237 FALSE 31.49607 0.040931452 0.005888638 1968.621
## 5 715247.7 9557117 FALSE 44.07456 0.009686948 0.005149810 3007.774
## 6 714927.7 9560777 FALSE 29.85981 -0.009047707 -0.005738329 1736.887
## log10_carea
## 1 2.784319
## 2 4.146013
## 3 3.643556
## 4 2.268703
## 5 3.003426
## 6 3.174073
This has a set of variables derived from a DEM that we used to build the model:
slope: slope angle (degrees)cplan: plan curvature (rad m−1) expressing the
convergence or divergence of a slope and thus water flowcprof: profile curvature (rad m-1) as a measure of flow
acceleration, also known as downslope change in slope angleelev: elevation (m a.s.l.) as the representation of
different altitudinal zones of vegetation and precipitation in the study
arealog10_carea: the decadic logarithm of the catchment
area (log10 m2) representing the amount of water flowing towards a
locationAs we need to be able to contrast where landslides have occurred,
with where they have not, the file contains a column lslpts
that indicates for each location if there was a landslide
(TRUE) or if it is a background point without a landslide
(FALSE). We will need to make this a factor -
this is a particular R data type used to differentiate categorical
data.
lsl <- lsl %>%
mutate(lslpts_f = factor(lslpts))
We’ll also use this dataframe to create an sf object to
see the locations:
lsl_sf <- st_as_sf(lsl, coords = c("x", "y"),
crs = 32717)
Next, we’ll read in the mask data, and map the locations within the mask:
lsl_mask <- st_read("./data/mask/study_mask.shp")
## Reading layer `study_mask' from data source
## `/Users/u0784726/Dropbox/Data/devtools/ugic-intro-r/data/mask/study_mask.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 1 feature and 1 field
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: 712112.8 ymin: 9556864 xmax: 715794 ymax: 9560905
## Projected CRS: WGS 84 / UTM zone 17S
tm_shape(lsl_mask) +
tm_borders() +
tm_shape(lsl_sf) +
tm_symbols(col = "lslpts", size = 0.75, alpha = 0.7)
You should be able to see that the landslides (yellow) tend to cluster in the center of the study region.
Next, read in the raster data using terra. This has
raster layers with the same variables in the lsl object.
Note that it is important that the names of these layers match the
column names in the dataframe exactly for prediction. We’ll also use the
mask to crop out the study area:
ta = rast("./data/ta.tif")
ta = mask(ta, mask = lsl_mask)
plot(ta)
If we add the locations to the elevation layer, you see that the cluster tends to occur at mid-elevations.
tm_shape(ta['elev']) +
tm_raster(palette = '-Greens', style = 'cont') +
tm_shape(lsl_mask) +
tm_borders() +
tm_shape(lsl_sf) +
tm_symbols(col = "lslpts", size = 0.5, alpha = 0.7) +
tm_layout(legend.outside = TRUE)
This is an otpional section. In this dataset, the terrain
characteristics (slope, aspect, etc) were precalculated. The
terra package has a function to estimate these
(terrain) from a DEM. The following code estimates the
slope and aspect in radians, then uses the shade() function
to create a hill shading map. This has arguments to position the light
source (angle = azimuth, direction =
direction):
slope <- terrain(ta['elev'], "slope", unit = 'radians')
aspect <- terrain(ta['elev'], "aspect", unit = 'radians')
hs <- shade(slope, aspect, angle = 45, direction = 90)
We can then use this as a base map in tmap, and
overlay another raster layer with a transparency
(alpha):
tm_shape(hs) +
tm_raster(palette = "Greys", legend.show = FALSE) +
tm_shape(ta['elev']) +
tm_raster(palette = '-Greens', alpha = 0.35, legend.show = FALSE)
We’ll start by building two relatively simple models to predict landslide risk. We’ll use a logistic regression model (similar to the linear model above, but designed for binary outcomes), and a random forest, a widely used machine learning method. For each of these, we’ll do three things:
ta raster object to get
a map of riskLogistic models can be fit in R using the glm()
function. This is designed for regression models that don’t have
continuous outcomes (in contrast to lm() that we used
above). As before, we need to define:
We also need to define a family. For binary outcomes,
this needs to be set to binomial:
fit_glm <- glm(lslpts ~ slope + cplan + cprof + elev + log10_carea,
family = binomial(),
data = lsl)
If you;d like to see the output of the model (e.g. coefficients), run
summary(fit_glm). I’m skipping this step here.
Next, we’ll test the prediction using the AUROC. This stands for Area
Under the Receiver Operating Characteristic curve. The full explanation
is a little complex, but essentially we use the model to predict a risk
value for each of the observed points with fitted(). The
AUROC compares these to the observed presence or absence of landslides.
AUROC values can range from 0 to 1, but we really need values above 0.5
to indicate a reasonable prediction (a value of 0.5 indicates that the
model is no better than flipping a coin!).
First, get predicted values for each location:
lsl_pred <- fitted(fit_glm)
head(lsl_pred)
## 1 2 3 4 5 6
## 0.19010373 0.11718500 0.09519487 0.25030946 0.33819463 0.15754547
You’ll see here that the predictions are made as a probability (i.e. in the range [0-1]). Now, let’s estimate the AUROC:
auc(roc(lsl_sf$lslpts, lsl_pred))
## Setting levels: control = FALSE, case = TRUE
## Setting direction: controls < cases
## Area under the curve: 0.8216
Giving a value of about 0.82, which is a pretty decent skill. Now,
let’s predict across the raster to get our risk map. A note here, we
want to use the predict() function from the
terra package, as this will return predictions as a
raster layer. To force this, we prepend the name of the packahe
(terra::) before the function.
pred <- terra::predict(ta, model = fit_glm, type = "response")
Finally, we can make our map:
m_glm <- tm_shape(pred) +
tm_raster() +
tm_shape(lsl_sf) +
tm_dots(size = 0.2) +
tm_layout(main.title = "GLM")
m_glm
Darker shades on this map indicate areas with higher risks.
Now let’s repeat these steps with a random forest. There are several packages for this, but we’ll use the RandomForest package, which is the oldest and most stable. You should see that while we are using a different package and function to build the model, the formula syntax is exactly the same as before. You’ll probbaly get a warning, which we’ll ignore for now.
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
fit_rf <- randomForest(lslpts ~ slope + cplan + cprof + elev + log10_carea,
data = lsl)
Now estimate the AUROC. We get a small improvement here, from 0.82 to about 0.85.
lsl_pred <- predict(fit_rf)
auc(roc(lsl_sf$lslpts, lsl_pred))
## Setting levels: control = FALSE, case = TRUE
## Setting direction: controls < cases
## Area under the curve: 0.8485
And predict on the raster:
pred = terra::predict(ta, model = fit_rf)
m_rf <- tm_shape(pred) +
tm_raster() +
tm_shape(lsl_sf) +
tm_dots(size = 0.2) +
tm_layout(main.title = "Random Forest")
m_rf
tmap_arrange(m_glm, m_rf)
In the previous section, we built and tested two model. However, the AUROC values that we used to compare them are not a good estimate of predictive skill. The reason for this is that we used the same set of data to build and then test the model - the test is not an independent assessment. To get around this, we can use a cross-validation approach to model testing. In this, the dataset is split into two: a training set used to build the model, and a testing set used to, well, test it. As the test set is not used to train or build the model, it is considered independent (the model does not ‘see’ these data), and the resulting test is a better estimate of predictive skill.
While you can set up and run cross-validation by hand, it’s generally easaier to do this with a helper package. We’ll use mlr3 that you installed and loaded earlier. We need to define a series of things to make this work:
taskThis defines the dataset, the outcome or target variable and the
independent or features used to model the outcome. This needs to have at
a minimum: - The backend: the dataframe or obkect holding
the data - The target: the variable in the backend that we
want to predict
The function to create this is TaskClassif. This is
designed for binary or categorical outcomes. There is equally a
TaskRegr for continuous outcomes.
lsl_task <- TaskClassif$new(id = "lsl", backend = lsl,
target = "lslpts_f")
We can check how this is set up (i.e. what is the target and the features/variables):
lsl_task$col_roles
## $feature
## [1] "cplan" "cprof" "elev" "log10_carea" "lslpts"
## [6] "slope" "x" "y"
##
## $target
## [1] "lslpts_f"
##
## $name
## character(0)
##
## $order
## character(0)
##
## $stratum
## character(0)
##
## $group
## character(0)
##
## $weight
## character(0)
##
## $always_included
## character(0)
Note that this includes the coordinates and the original landslide labels, so we’ll exclude these so they are not used in the model:
lsl_task$col_roles$feature <- setdiff(lsl_task$col_roles$feature,
c("x", "y", "lslpts"))
lsl_task$feature_names
## [1] "cplan" "cprof" "elev" "log10_carea" "slope"
measureThis is score that is used to assess how well the model predicts, and
is defined using msr(). We’ll use two here: the AUROC ()
and the accuracy (the proportion of points that are correctly
predicted)
msr_auc = msr("classif.auc")
msr_acc = msr("classif.acc")
If you want to see the full set of measures, just run the following code:
mlr_measures
resamplingThis is the cross-validation strategy; the way in which the data will
be split into training and testing, and is defined using
rsmp(). We'll use a $k$-fold cross-validation (cv`).
This splits the data \(k\) times. So
for a 5-fold CV, the data is split into 80% training and 20% testing,
and then this is repeated 4 more times. This ensures that all datapoints
are used once in the testing set and gives an exhaustive evaluation.
rsmp_cv = rsmp("cv", folds = 5)
If you want to see the full set of resampling strategies, just run the following code:
mlr_resamplings
learnerThe machine learning algorithm to be used, this is defined using
lrn(). We’ll use a logistic model again
(classif.log_reg):
lrn_glm = lrn("classif.log_reg",
predict_type = "prob")
If you want to see the full set of learners, just run the following code:
mlr_learners
Now all this is defined, we can actually run the cross-validation and
get the results. The function to run this is called
resample(), and has arguments for the task, the learner and
the cross-validation.
rr_glm = resample(lsl_task,
lrn_glm,
rsmp_cv,
store_models = TRUE)
## INFO [17:23:58.413] [mlr3] Applying learner 'classif.log_reg' on task 'lsl' (iter 1/5)
## INFO [17:23:58.437] [mlr3] Applying learner 'classif.log_reg' on task 'lsl' (iter 2/5)
## INFO [17:23:58.446] [mlr3] Applying learner 'classif.log_reg' on task 'lsl' (iter 3/5)
## INFO [17:23:58.451] [mlr3] Applying learner 'classif.log_reg' on task 'lsl' (iter 4/5)
## INFO [17:23:58.457] [mlr3] Applying learner 'classif.log_reg' on task 'lsl' (iter 5/5)
This should run pretty quickly, and all the results are stored in
rr_glm. We asked the function to store the individual
models, and you can access them (if you need to) with
rr_glm$learners.
You can now get the average AUROC and accuracy for the 5 models with:
rr_glm$aggregate(c(msr_auc, msr_acc))
## classif.auc classif.acc
## 0.8218499 0.7400000
You can also see the values for the individual folds with:
rr_glm$score(msr_auc)
## task_id learner_id resampling_id iteration classif.auc
## <char> <char> <char> <int> <num>
## 1: lsl classif.log_reg cv 1 0.7624898
## 2: lsl classif.log_reg cv 2 0.8026316
## 3: lsl classif.log_reg cv 3 0.8273026
## 4: lsl classif.log_reg cv 4 0.8503289
## 5: lsl classif.log_reg cv 5 0.8664966
## Hidden columns: task, learner, resampling, prediction
This all probably seems like a lot of effort to get to this point, but now we have everything set up, it is very easy to test different machine learning algorithms. All we have to do is define a new learner (here a random forest), re-run the resampling and check the resulting scores:
lrn_rf = lrn("classif.ranger",
predict_type = "prob",
importance = "permutation")
rr_rf = resample(lsl_task,
lrn_rf,
rsmp_cv,
store_models = TRUE)
## INFO [17:23:58.560] [mlr3] Applying learner 'classif.ranger' on task 'lsl' (iter 1/5)
## INFO [17:23:58.643] [mlr3] Applying learner 'classif.ranger' on task 'lsl' (iter 2/5)
## INFO [17:23:58.706] [mlr3] Applying learner 'classif.ranger' on task 'lsl' (iter 3/5)
## INFO [17:23:58.772] [mlr3] Applying learner 'classif.ranger' on task 'lsl' (iter 4/5)
## INFO [17:23:58.833] [mlr3] Applying learner 'classif.ranger' on task 'lsl' (iter 5/5)
rr_rf$aggregate(c(msr_auc, msr_acc))
## classif.auc classif.acc
## 0.8637583 0.7828571
Our final AUROC score for the random forest is 0.864. These scores are overall similar to the results from the non-independent test earlier, but are now a much more robust estimate of predictive skill.
These two exercises have introduced some basic statistical and machine methods in R, but they barely scratch the surface of what is possible. There are many other methods and approaches out there. Here’s a very shortlist of possible resource that may be helpful: